This is a first assignment of the Introduction to Open Data Science 2018 course. Git / rstudio / rmarkdown exercises.
library(data.table)
# Read in learning2014 created with create_learning2014.R
learning2014<-fread("learning2014.txt")
# Number of records and number of variables in the dataset
dim(learning2014)
## [1] 166 7
# Variable types (integers are integer variables, numerical are continuous floating number variables)
str(learning2014)
## Classes 'data.table' and 'data.frame': 166 obs. of 7 variables:
## $ Gender : int 2 1 2 1 1 2 1 2 1 2 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ Stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ Surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, ".internal.selfref")=<externalptr>
Gender: Male = 1, Female = 2
Age: Age (in years) derived from the date of birth
Attitude: Global attitude toward statistics
Deep: Deep approach
Stra: Strategic approach
Surf: Surface approach
Points: Yhteispisteet (max kaikista)
summary(learning2014)
## Gender Age Attitude Deep
## Min. :1.000 Min. :17.00 Min. :14.00 Min. :1.583
## 1st Qu.:1.000 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333
## Median :2.000 Median :22.00 Median :32.00 Median :3.667
## Mean :1.663 Mean :25.51 Mean :31.43 Mean :3.680
## 3rd Qu.:2.000 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083
## Max. :2.000 Max. :55.00 Max. :50.00 Max. :4.917
## Stra Surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
par(mfrow=c(3,3))
sapply(1:ncol(learning2014),function(i) hist(learning2014[[i]], breaks=20, xlab=NA,main=names(learning2014)[i]))
par(mfrow=c(3,3))
sapply(1:ncol(learning2014),function(i) plot(density(learning2014[[i]]), xlab=NA,main=names(learning2014)[i]))
Age has a long right tail. Attitude, Deep, Stra, and Surf could be considered approximately normally distributed. From Surf and Stra histograms we can clearly see that they are sum variables and have multiple modes (peaks in the histogram).
pairs(learning2014)
Visually at least Attitude seems to have relationship with Points. To further investigate this, let’s additionally draw a correlation plot.
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor(learning2014, method="spearman"), type="lower")
Attitude clearly correlates with Points. Furthermore, Age, gender, stra, and surf all seem to have a small correlation. Given we are allowed to choose only three variables for this exercise, age and gender with attitude seem most interesting combination.
Lets try first a model with Age, Gender (basic demographic covariates) and Attitude as a third one.
#library(rms)
summary(lm(Points~Age+I(as.factor(Gender))*Attitude, data=learning2014))
##
## Call:
## lm(formula = Points ~ Age + I(as.factor(Gender)) * Attitude,
## data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3457 -3.3677 0.1811 3.8529 10.2281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.21591 4.18555 2.680 0.008135 **
## Age -0.07516 0.05379 -1.397 0.164246
## I(as.factor(Gender))2 2.84438 4.45715 0.638 0.524274
## Attitude 0.41480 0.11115 3.732 0.000263 ***
## I(as.factor(Gender))2:Attitude -0.07583 0.13155 -0.576 0.565115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.326 on 161 degrees of freedom
## Multiple R-squared: 0.2034, Adjusted R-squared: 0.1836
## F-statistic: 10.28 on 4 and 161 DF, p-value: 1.944e-07
Apparently age and gender have no statistical effect in the model, let’s keep only attitude.
summary(lm(Points~Attitude, data=learning2014))
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
As attitude has ambiguous dimensions, let’s mean scale it (mean=0, sd=1) for easier interpretation.
summary(lm(Points~I(scale(Attitude)), data=learning2014))
##
## Call:
## lm(formula = Points ~ I(scale(Attitude)), data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.7169 0.4129 55.019 < 2e-16 ***
## I(scale(Attitude)) 2.5733 0.4141 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now we can say that Attitude is highly significant in this model (p=4e-9). It’s effect (beta) is 2.5733, which means that one standard deviation increase in Attitude increases Points by apx. 2.57.
Multiple R-squared is 0.1906 which means that Attitude alone explains 19% of variability of Points.
plot(lm(Points~Attitude, data=learning2014))
Residuals do not seem to have any non-linear patterns. QQ-plot indicates that residuals are approximately normally distributed which further confirms that linear model assumptions are met. Scale-location points shows no apparent heteroscedasticity although cases 145, 56, and 36 should potentially be further investigated as interesting outlier cases. However, none of these cases (or any other cases), fall beyond Cook’s distance in the fourth plot. If they would, it would mean they probably should be excluded from the analysis as outliers that influence model too much.
In summary, we can conclude, that assumptions of linear model are probably very well met.
library(data.table)
# Read in learning2014 created with create_learning2014.R
alc<-fread("data/alc.txt")
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This “Student Performance Data Set” dataset is derived from UCI Machine learning repository. It’s described as:
“This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).”
We joined these two datasets (mat and por) together, overlapping variables were either averaged (quantitative) or the one from math selected (qualitative). Alc_use and high_use were calculated from the data.
Let’s select four variables for the analysis:
vars<-c("age","sex","Medu","goout")
str(alc[,vars,with=F])
## Classes 'data.table' and 'data.frame': 382 obs. of 4 variables:
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ sex : chr "F" "F" "F" "F" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ goout: int 4 3 2 2 2 2 4 4 2 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
Age is a continuous variable, gender is dichotomous, mother’s education is multinomial, and going out is ordinal/continuous (likert).
library(ggplot2)
library(GGally)
#ggpairs(alc, columns=c(vars,"high_use"))
library(ggridges)
library(magrittr)
library(dplyr)
ggplot(data=alc, aes(x=high_use, y=age)) + geom_boxplot()
# Differenes in mean and standard deviation
alc %>% group_by(high_use) %>% summarise(mean_age=mean(age), sd_age=sd(age))
## # A tibble: 2 x 3
## high_use mean_age sd_age
## <lgl> <dbl> <dbl>
## 1 FALSE 16.5 1.15
## 2 TRUE 16.8 1.21
# Univariate non-parametric test
wilcox.test(age~high_use, data=alc)
##
## Wilcoxon rank sum test with continuity correction
##
## data: age by high_use
## W = 13266, p-value = 0.05185
## alternative hypothesis: true location shift is not equal to 0
Distribution seems to be somewhat skewed, thus wilcox.test (mann-whitney). It indicates assocation, although not very strong: Older students are likely to have higher consumption which makes intuitive sense.
ggplot(data=alc, aes(x=sex, fill=high_use)) + geom_bar()
table(alc$high_use, alc$sex)
##
## F M
## FALSE 157 113
## TRUE 41 71
chisq.test(x=alc$high_use, y=alc$sex)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: alc$high_use and alc$sex
## X-squared = 13.863, df = 1, p-value = 0.0001967
Gender seems to have strong univariate association as anticipated.
ggplot(data=alc, aes(x=Medu, fill=high_use)) + geom_bar()
table(alc$high_use, alc$Medu)
##
## 0 1 2 3 4
## FALSE 1 33 80 59 97
## TRUE 2 18 18 36 38
chisq.test(y=alc$high_use, x=as.factor(alc$Medu))
## Warning in chisq.test(y = alc$high_use, x = as.factor(alc$Medu)): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: as.factor(alc$Medu) and alc$high_use
## X-squared = 12.031, df = 4, p-value = 0.01713
Mother’s education seems to associate slightly which we expected a priori.
ggplot(data=alc, aes(x=high_use, y=goout)) + geom_boxplot()
# Differenes in mean and standard deviation
alc %>% group_by(high_use) %>% summarise(mean_age=mean(goout), sd_age=sd(goout))
## # A tibble: 2 x 3
## high_use mean_age sd_age
## <lgl> <dbl> <dbl>
## 1 FALSE 2.86 1.04
## 2 TRUE 3.73 1.10
# Univariate non-parametric test
wilcox.test(goout~high_use, data=alc)
##
## Wilcoxon rank sum test with continuity correction
##
## data: goout by high_use
## W = 8577, p-value = 5.88e-12
## alternative hypothesis: true location shift is not equal to 0
“Going out” indiciates a strong univariate association, and makes naively sense.
# Let's use Harrell's excellent regression model strategies package
library(rms)
# Let's make sure Medu is factor and scale likert scale "going out" (center to zero, very high=1, very low=-1)
alc[,Medu := as.factor(Medu)]
alc[,goout := (goout - 3)/2]
lrm(high_use~age+sex+Medu+goout, data=alc)
## Logistic Regression Model
##
## lrm(formula = high_use ~ age + sex + Medu + goout, data = alc)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 382 LR chi2 78.20 R2 0.264 C 0.760
## FALSE 270 d.f. 7 g 1.301 Dxy 0.519
## TRUE 112 Pr(> chi2) <0.0001 gr 3.672 gamma 0.523
## max |deriv| 6e-12 gp 0.226 tau-a 0.216
## Brier 0.163
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -0.2993 2.2897 -0.13 0.8960
## age 0.0703 0.1094 0.64 0.5204
## sex=M 0.9357 0.2551 3.67 0.0002
## Medu=1 -2.0598 1.2909 -1.60 0.1106
## Medu=2 -3.1142 1.2888 -2.42 0.0157
## Medu=3 -2.0764 1.2741 -1.63 0.1032
## Medu=4 -2.6082 1.2745 -2.05 0.0407
## goout 1.5668 0.2468 6.35 <0.0001
##
confint.default(lrm(high_use~age+sex+Medu+goout, data=alc))
## 2.5 % 97.5 %
## Intercept -4.7870567 4.1883960
## age -0.1441360 0.2847951
## sex=M 0.4356889 1.4357720
## Medu=1 -4.5898916 0.4702347
## Medu=2 -5.6401508 -0.5881813
## Medu=3 -4.5736251 0.4209207
## Medu=4 -5.1061906 -0.1101853
## goout 1.0831240 2.0504201
ggcoef(glm(high_use~age+sex+Medu+goout, data=alc, family="binomial"), exponentiate = F, errorbar_height=.2, color="blue", sort="ascending")
Results seem quite interesting (below p-value, OR and 95% CI respectively):
# Remove age from the model (p-value was low)
mod2<-lrm(high_use~sex+Medu+goout, data=alc, x=T, y=T)
# Get predictions an threshold at zero
preds<-as.numeric(predict(mod2)>0)
# Confusion matrix
cmat<-table(preds, as.numeric(alc$high_use))
cmat
##
## preds 0 1
## 0 254 68
## 1 16 44
# Training error rate (proportion is incorrectly classified individuals)
((cmat[1,2]+cmat[2,1])/sum(cmat))
## [1] 0.2198953
Error rate (22%) is better than randomly guessing (50%). However, this is likely rather optimistic as it is in-sample error rate.
We can do cross-validation natively with rms package.
validate(mod2, method="crossvalidation", B=10)
## index.orig training test optimism index.corrected n
## Dxy 0.5188 0.5188 0.4588 0.0600 0.4588 10
## R2 0.2625 0.2643 0.2410 0.0233 0.2392 10
## Intercept 0.0000 0.0000 -0.0677 0.0677 -0.0677 10
## Slope 1.0000 1.0000 0.9827 0.0173 0.9827 10
## Emax 0.0000 0.0000 0.0181 0.0181 0.0181 10
## D 0.2010 0.2025 0.1657 0.0368 0.1642 10
## U -0.0052 -0.0058 0.0538 -0.0596 0.0544 10
## Q 0.2062 0.2083 0.1118 0.0964 0.1098 10
## B 0.1634 0.1630 0.1715 -0.0085 0.1718 10
## g 1.2971 1.3068 1.2473 0.0595 1.2377 10
## gp 0.2252 0.2253 0.1886 0.0367 0.1885 10
Which shows optimism in original model without cross validation (e.g. in Somer’s D)
We can also include all variables to the model and perform backwards stepwise regression in rms package:
# Add all variables to the model (excluding those that were used to create the outcome variable)
mod3<-lrm(high_use~., data=alc[,-c("alc_use","Dalc","Walc"),with=F],x=T, y=T)
validate(mod3, method="crossvalidation", B=10, bw=T, pr=F)
##
## Backwards Step-down - Original Model
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC
## Mjob 4.57 4 0.3339 4.57 4 0.3339 -3.43
## Fjob 4.59 4 0.3316 9.17 8 0.3284 -6.83
## internet 0.00 1 0.9707 9.17 9 0.4219 -8.83
## schoolsup 0.01 1 0.9086 9.18 10 0.5150 -10.82
## G3 0.03 1 0.8710 9.21 11 0.6027 -12.79
## higher 0.05 1 0.8278 9.26 12 0.6810 -14.74
## Pstatus 0.05 1 0.8156 9.31 13 0.7492 -16.69
## age 0.08 1 0.7816 9.39 14 0.8055 -18.61
## failures 0.18 1 0.6746 9.56 15 0.8463 -20.44
## Fedu 0.21 1 0.6490 9.77 16 0.8784 -22.23
## famsup 0.37 1 0.5408 10.14 17 0.8975 -23.86
## reason 4.56 3 0.2071 14.70 20 0.7931 -25.30
## school 0.31 1 0.5795 15.01 21 0.8224 -26.99
## romantic 0.46 1 0.4994 15.47 22 0.8415 -28.53
## nursery 0.50 1 0.4781 15.97 23 0.8566 -30.03
## freetime 0.74 1 0.3884 16.71 24 0.8606 -31.29
## famsize 0.99 1 0.3186 17.71 25 0.8545 -32.29
## guardian 2.99 2 0.2247 20.69 27 0.8004 -33.31
## health 1.66 1 0.1975 22.35 28 0.7646 -33.65
## traveltime 2.10 1 0.1469 24.46 29 0.7060 -33.54
## Medu 8.39 4 0.0783 32.85 33 0.4746 -33.15
## paid 1.82 1 0.1768 34.67 34 0.4357 -33.33
## studytime 2.71 1 0.0995 37.39 35 0.3600 -32.61
## G2 3.14 1 0.0762 40.53 36 0.2772 -31.47
## G1 0.68 1 0.4103 41.21 37 0.2916 -32.79
## address 3.76 1 0.0524 44.97 38 0.2030 -31.03
## activities 3.88 1 0.0490 48.85 39 0.1341 -29.15
## famrel 5.79 1 0.0161 54.64 40 0.0613 -25.36
## sex 6.38 1 0.0115 61.02 41 0.0228 -20.98
## absences 6.59 1 0.0102 67.61 42 0.0074 -16.39
## goout 15.15 1 0.0001 82.76 43 0.0003 -3.24
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## [1,] -0.6665 0.1432 -4.653 3.27e-06
##
## Factors in Final Model
##
## None
## index.orig training test optimism index.corrected n
## Dxy 0.0000 0.0000 0.0000 0.0000 0.0000 10
## R2 0.0000 0.0000 0.0000 0.0000 0.0000 10
## Intercept 0.0000 0.0000 -0.9021 0.9021 -0.9021 10
## Slope 1.0000 1.0000 1.0000 0.0000 1.0000 10
## Emax 0.0000 0.0000 0.2218 0.2218 0.2218 10
## D -0.0026 -0.0029 -0.0263 0.0234 -0.0260 10
## U -0.0052 -0.0058 -0.0317 0.0258 -0.0311 10
## Q 0.0026 0.0029 0.0053 -0.0024 0.0050 10
## B 0.2072 0.2072 0.2076 -0.0004 0.2076 10
## g 0.0000 0.0000 0.0000 0.0000 0.0000 10
## gp 0.0000 0.0000 0.0000 0.0000 0.0000 10
##
## Factors Retained in Backwards Elimination
##
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason nursery
##
##
##
##
##
##
##
##
##
##
## internet guardian traveltime studytime failures schoolsup famsup paid
##
##
##
##
##
##
##
##
##
##
## activities higher romantic famrel freetime goout health absences G1 G2 G3
##
##
##
##
##
##
##
##
##
##
##
## Frequencies of Numbers of Factors Retained
##
## 0
## 10
set.seed(12345)
library(MASS)
library(dplyr)
data(Boston)
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
library(GGally)
library(corrplot)
corrplot(cor(Boston, method="spearman"), method = "circle")
newBost<-Boston
newBost$chas<-as.factor(newBost$chas)
newBost$rad<-as.factor(newBost$rad)
ggpairs(newBost)
Many variables seem to have non-normal distribution, many seem even heavy-tailed and few binomial (having two modes, i.e. peaks, such as indus, tax, and rad). Also strong positive and negative correlations are present among several variables.
fLet’s plot density of each variable separately:
par(mfrow=c(4,4), oma = c(5,4,0,0) + 0.1,
mar = c(0,0,1,1) + 0.1)
tmp<-lapply(1:ncol(Boston), function(i) plot(density(as.numeric(Boston[,i])), main=colnames(Boston)[i]))
And next normalize (scale) these variables to have zero mean and standard deviation of one. This can be somewhat questionable for e.g. heavy-tailed distributions where variance and standard devation may not be defined at all, but it’s requested by the assignment instructions.
Let’s scale and plot new variables (mean has been moved to zero and standard devation scaled to one):
# Scale original variables
newBost<-Boston
newBost<-(apply(newBost, 2, function(col) {
if (is.numeric(col)) scale(col)
}))
# Plot scaled variable
par(mfrow=c(4,4), oma = c(5,4,0,0) + 0.1,
mar = c(0,0,1,1) + 0.1)
tmp<-lapply(1:ncol(newBost), function(i) plot(density(as.numeric(Boston[,i])), main=colnames(newBost)[i]))
newBost<-as.data.frame(newBost)
Create a factor with quantiles of crim (and replace original variable). Let’s also do a 80:20 split of the dataset, by creating a holdout indicator.
newBost$crim<-with(newBost, cut(crim, quantile(crim, c(0, 0.25, 0.5, 0.75, 1)), labels = c("Q1", "Q2", "Q3", "Q4")))
holdout<-rbinom(n = nrow(newBost), size = 1, prob = 0.2)
print(table(holdout))
## holdout
## 0 1
## 393 113
trainset<-newBost[holdout==0,]
testset<-newBost[holdout==1,]
Estimate model using trainset, predict in testset (remove original) and compare to original crim in testset. Crosstabulate predictions with original values.
library(MASS)
mod1<-lda(crim~., data=trainset)
test_crim<-testset$crim
testset$crim<-predict(mod1, newdata=testset)$class
table(test_crim, testset$crim)
##
## test_crim Q1 Q2 Q3 Q4
## Q1 13 17 3 0
## Q2 3 13 10 0
## Q3 0 5 25 0
## Q4 0 0 0 24
cat(paste("Test error rate:", round(mean(test_crim != testset$crim)*100, 1), "%"))
## Test error rate: 33.6 %
Reload boston and rescale. Sum of squares to find right number of clusters:
# Reload Boston and rescale
newBost<-Boston
newBost<-(apply(newBost, 2, function(col) {
if (is.numeric(col)) scale(col)
}))
wss <- sapply(1:20,
function(k){kmeans(newBost, k, nstart=50, iter.max = 15)$tot.withinss})
wss
## [1] 7070.000 4576.681 3871.374 3445.946 3074.612 2723.217 2436.911
## [8] 2248.218 2064.030 1940.011 1849.925 1729.998 1659.495 1598.493
## [15] 1544.558 1476.380 1426.614 1385.056 1333.214 1286.531
plot(1:20, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
This approach does’t seem to give any insight. Let’s try Gap statistic:
library(cluster)
cg<-clusGap(newBost, kmeans, 10, B = 100)
plot(cg)
This approach suggest nine (although increasing number to significantly higher levels could be argued). Let’s build a kmeans clustering model with nine clusters and plot that:
km1<-kmeans(newBost, 9)
ggpairs(as.data.frame(newBost), aes(col=as.factor(km1$cluster)))
Read in the data and visualize with ggpairs and spearman correlations.
library(data.table)
library(GGally)
library(corrplot)
library(dplyr)
human<-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", row.names=1, header=T, sep=",")
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
ggpairs(human)
corrplot(cor(human, method="spearman"))
All variables seem to be unimodal and most have skewed non-normal distributions, some with heavy tails. Potentially only Edu.Exp could be estimated with a normal distribution. Correlation among Edu2FM, Edu.Exp, Life.Exp, GNI, Mat.Mor, and Ado.Birth seem high (both positive and negative correlations). Labo.FM and Parli.F do not seem to strongly correlate with other variables.
Next we were asked to perform a PCA and draw a biplot using raw and then with scaled data. (Scaling non-normal distributions using standard deviations and mean can be considered somehwat questionnable, but let’s follow the assignment).
pc<-princomp(human)
biplot(pc, cex=c(0.3, 1), main="Non-scaled PCA")
pc<-princomp(scale(human))
biplot(pc, cex=c(0.3, 1), main="Scaled PCA", xlab="PC1, female opportunity dimension (individual)", ylab="PC2, general female participation dimension (society level)")
barplot(pc$sdev)
summary(pc)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 2.0641471 1.1360379 0.87222124 0.77634649
## Proportion of Variance 0.5360463 0.1623703 0.09571374 0.07582845
## Cumulative Proportion 0.5360463 0.6984166 0.79413031 0.86995876
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 0.65981755 0.53457325 0.45751637 0.32119948
## Proportion of Variance 0.05477328 0.03595302 0.02633506 0.01297988
## Cumulative Proportion 0.92473204 0.96068506 0.98702012 1.00000000
The non-scaled PCA seems to be highly driven by GNI because it has large absolute variance compared to other variables. Thus, some kind of scaling makes sense, to give all variables same original weight to start with.
In scaled PCA, x-dimension seems to gather variables that are related to individual level freedom / opportunities of women (explains 53% of variance). The y-axis seems to map variables that indicate general society participation level (i.e. politics) of women (explains 16% of variance). Together these two first PCs explain nearly 70% of all variance.
library(FactoMineR)
data(tea)
glimpse(tea)
## Observations: 300
## Variables: 36
## $ breakfast <fct> breakfast, breakfast, Not.breakfast, Not.brea...
## $ tea.time <fct> Not.tea time, Not.tea time, tea time, Not.tea...
## $ evening <fct> Not.evening, Not.evening, evening, Not.evenin...
## $ lunch <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, N...
## $ dinner <fct> Not.dinner, Not.dinner, dinner, dinner, Not.d...
## $ always <fct> Not.always, Not.always, Not.always, Not.alway...
## $ home <fct> home, home, home, home, home, home, home, hom...
## $ work <fct> Not.work, Not.work, work, Not.work, Not.work,...
## $ tearoom <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.te...
## $ friends <fct> Not.friends, Not.friends, friends, Not.friend...
## $ resto <fct> Not.resto, Not.resto, resto, Not.resto, Not.r...
## $ pub <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, ...
## $ Tea <fct> black, black, Earl Grey, Earl Grey, Earl Grey...
## $ How <fct> alone, milk, alone, alone, alone, alone, alon...
## $ sugar <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, N...
## $ how <fct> tea bag, tea bag, tea bag, tea bag, tea bag, ...
## $ where <fct> chain store, chain store, chain store, chain ...
## $ price <fct> p_unknown, p_variable, p_variable, p_variable...
## $ age <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 3...
## $ sex <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, ...
## $ SPC <fct> middle, middle, other worker, student, employ...
## $ Sport <fct> sportsman, sportsman, sportsman, Not.sportsma...
## $ age_Q <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-...
## $ frequency <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3...
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.e...
## $ spirituality <fct> Not.spirituality, Not.spirituality, Not.spiri...
## $ healthy <fct> healthy, healthy, healthy, healthy, Not.healt...
## $ diuretic <fct> Not.diuretic, diuretic, diuretic, Not.diureti...
## $ friendliness <fct> Not.friendliness, Not.friendliness, friendlin...
## $ iron.absorption <fct> Not.iron absorption, Not.iron absorption, Not...
## $ feminine <fct> Not.feminine, Not.feminine, Not.feminine, Not...
## $ sophisticated <fct> Not.sophisticated, Not.sophisticated, Not.sop...
## $ slimming <fct> No.slimming, No.slimming, No.slimming, No.sli...
## $ exciting <fct> No.exciting, exciting, No.exciting, No.exciti...
## $ relaxing <fct> No.relaxing, No.relaxing, relaxing, relaxing,...
## $ effect.on.health <fct> No.effect on health, No.effect on health, No....
All variables seem to be multinomial / categorical apart from age variable, and most are not ordered. The proper way to assess their relationships is not through correlations but using e.g. Cramer’s V:
library(vcd)
tmp<-lapply(1:ncol(tea), function(i) sapply(1:ncol(tea), function(j) {assocstats(table(tea[,i], tea[,j]))$cramer}) )
kor<-as.matrix(as.data.frame(tmp))
rownames(kor)<-colnames(tea)
colnames(kor)<-colnames(tea)
corrplot(kor, order="hclust", method="square")
We can see that age seems to be related to all other variables. Also where, price, and how form correlated cluster as well as breakfast and frequency.
Let’s conduct multiple correspondence analysis. Data description says: “The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).”
Furhtermore, “The first 18 questions are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables.”
Thus, let’s set last variables (starting from 19) as supplementary ones (age, col 19, as quantitative).
library(factoextra)
mod1<-MCA(tea,quanti.sup=19,quali.sup=20:36, graph=F)
fviz_screeplot(mod1, addlabels = TRUE, ylim = c(0, 45))
fviz_mca_biplot(mod1,
repel = F, # Avoid text overlapping (slow if many point)
ggtheme = theme_minimal())
fviz_mca_var(mod1, choice = "mca.cor",
repel = TRUE, # Avoid text overlapping (slow)
ggtheme = theme_minimal())
# Contributions of rows to dimension 1
fviz_contrib(mod1, choice = "var", axes = 1, top = 15)
# Contributions of rows to dimension 2
fviz_contrib(mod1, choice = "var", axes = 2, top = 15)
# Total contribution to dimension 1 and 2
fviz_contrib(mod1, choice = "var", axes = 1:2, top = 15)